Skip to content

snapshots: lthash #5954

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

cali-jumptrading
Copy link
Contributor

@cali-jumptrading cali-jumptrading commented Aug 6, 2025

Adds snapshot hashing tiles. They are disabled by default but enabled in CI to verify the snapshot lthash in ledgers. We plan to enable them by default once their performance can keep up with download speed.

@cali-jumptrading cali-jumptrading force-pushed the cali/snapshot-tiles-lthash branch 5 times, most recently from 1962ac8 to 40a602b Compare August 6, 2025 23:02
@cali-jumptrading cali-jumptrading marked this pull request as ready for review August 7, 2025 15:22
@cali-jumptrading cali-jumptrading force-pushed the cali/snapshot-tiles-lthash branch 2 times, most recently from cf0495e to fd5559d Compare August 7, 2025 17:42
@cali-jumptrading cali-jumptrading marked this pull request as draft August 7, 2025 19:02
@cali-jumptrading cali-jumptrading force-pushed the cali/snapshot-tiles-lthash branch 21 times, most recently from a634319 to 766d347 Compare August 9, 2025 06:03
@cali-jumptrading cali-jumptrading force-pushed the cali/snapshot-tiles-lthash branch 2 times, most recently from e0b3efe to 2b8601c Compare August 13, 2025 15:17
@topointon-jump topointon-jump self-requested a review August 13, 2025 16:04
@cali-jumptrading cali-jumptrading force-pushed the cali/snapshot-tiles-lthash branch from 2b8601c to 878a1ac Compare August 13, 2025 16:28
Comment on lines +701 to +707
# How many snapshot hash tiles to run. The snapshot hash tiles are
# responsible for verifying the hash of all accounts in the loaded
# snapshot via lthash (lattice hashing). Currently, set to 0 by
# default because it is too slow to run in the full client. TODO:
# enable snapshot hash tiles in the full client and update this
# comment.
hash_tile_count = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make it clear that if the hash tile count is 0, no hashing will be done? Wondering if we should instead have a snapshots.verify flag instead to enable/disable the hashing, instead of hash_tile_count = 0 => no hashing.

Comment on lines +174 to +175
/* TODO: cache the previous lthash somewhere in funk to avoid
recalculating hash. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We considered doing this... it would mean an additional 2048 bytes per account which is probably not ideal.. this storage-compute tradeoff is something we're exploring for transaction execution also and we should revisit this after the hashing changes and database changes are done.

@cali-jumptrading cali-jumptrading linked an issue Aug 13, 2025 that may be closed by this pull request
@cali-jumptrading cali-jumptrading force-pushed the cali/snapshot-tiles-lthash branch 5 times, most recently from 91db563 to 40bb2e2 Compare August 13, 2025 23:18
@cali-jumptrading cali-jumptrading force-pushed the cali/snapshot-tiles-lthash branch from 40bb2e2 to 5f5ce80 Compare August 15, 2025 21:23
@@ -698,6 +698,14 @@ user = ""
# requests.
sign_tile_count = 2

# How many snapshot hash tiles to run. The snapshot hash tiles are
Copy link
Contributor

@mmcgee-jump mmcgee-jump Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things,

  • hash_tile_count is very vague and could refer anything. Be more specific, in this case the convention is <tilename>_tile_count so snaplt_tile_count ?
  • Allowing someone to turn this off is probably not desirable, and we should just make this config production ready now. I'd suggest something like,
Suggested change
# How many snapshot hash tiles to run. The snapshot hash tiles are
[layout]
# How many snapshot lthash tiles to run. Snapshot hash tiles verify
# the contents of accounts in the loaded snapshot, in parallel. Too
# few and snapshot loading might be delayed. Once the hash of the
# accounts in a snapshot is calculated, if it does not match the
# validator will abort with an error.
snaplt_tile_count = 1
[development]
[snapshots]
# Set to true to disable verification of the lthash in the
# loaded snapshot. This is not safe or supported in production
# and should only be used for development and testing purposes.
#
# If lthash verification is disabled, the validator will not
# start any lthash tiles and the value of `snaplt_tile_count`
# in the layout will be ignored.
disable_lthash_verification = false

Once you have this, you can:

  • Check that lthash_tile_count > 0
  • Check that the development option is not enabled when running against a production cluster (same as some other options, e.g. we check this for --no-sandbox).

@@ -201,6 +202,7 @@ fd_topo_initialize( config_t * config ) {
ulong bank_tile_cnt = config->layout.bank_tile_count;
ulong exec_tile_cnt = config->firedancer.layout.exec_tile_count;
ulong writer_tile_cnt = config->firedancer.layout.writer_tile_count;
ulong hash_tile_cnt = config->firedancer.layout.hash_tile_count;
Copy link
Contributor

@mmcgee-jump mmcgee-jump Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ulong hash_tile_cnt = config->firedancer.layout.hash_tile_count;
ulong snaplt_tile_cnt = config->firedancer.layout.snaplt_tile_count;

}
case FD_SNAPSHOT_HASH_MSG_ACCOUNT_HDR: {
FD_TEST( ctx->state==FD_SNAPHS_STATE_HASHING );
fd_snapshot_account_t * account = (fd_snapshot_account_t *)fd_chunk_to_laddr_const( ctx->in.wksp, chunk );
Copy link
Contributor

@mmcgee-jump mmcgee-jump Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be useful to verify

FD_TEST( ctx->acc_data_sz==0UL );

}
case FD_SNAPSHOT_HASH_MSG_ACCOUNT_DATA: {
FD_TEST( ctx->state==FD_SNAPHS_STATE_HASHING );
fd_memcpy( ctx->acc_data + ctx->acc_data_sz,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the lthash calculation is just a series of blake3 appends, so you can do it unbuffered / without memcpy by just streaming the account data into blake3 here?

uchar * manifest_lthash = ctx->full ? ctx->hash_info.full_lthash.bytes : ctx->hash_info.incremental_lthash.bytes;
fd_memcpy( manifest_lthash, manifest->accounts_lthash, sizeof(fd_lthash_value_t) );
} else {
/* disable lthash verification if there's no lthash in the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should probably just abort with an error here, or trigger a malformed snapshot.

@@ -106,6 +106,7 @@ extern fd_topo_run_tile_t fd_tile_shredcap;
extern fd_topo_run_tile_t fd_tile_snaprd;
extern fd_topo_run_tile_t fd_tile_snapdc;
extern fd_topo_run_tile_t fd_tile_snapin;
extern fd_topo_run_tile_t fd_tile_snaphs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add to main.c in firedancer

hash from the running lthash. */
if( FD_LIKELY( ctx->hash_info.enabled ) ) {
/* Send the existing account to the hash tile */
/* TODO: cache the previous lthash somewhere in funk to avoid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is extremely rare and not worth the memory. I'd remove the TODO, what you have is good

fd_topo_obj_t * replay_manifest_dcache = fd_topob_obj( topo, "dcache", "replay_manif" );
fd_pod_insertf_ulong( topo->props, 2UL << 30UL, "obj.%lu.data_sz", replay_manifest_dcache->id );
fd_pod_insert_ulong( topo->props, "manifest_dcache", replay_manifest_dcache->id );
FOR(hash_tile_cnt) fd_topob_link( topo, "snapin_hsh", "snapin_hsh", 128UL, sizeof(fd_snapshot_existing_account_t), 1UL );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convention for this kind of message passing is a single snapin_snaph link (not one per tile). You publish all frags on the link, and the snaphs tiles select which ones they want to process in the before_frag callback.


if( FD_LIKELY( ctx->hash_info.enabled ) ) {
for( ulong i=0UL; i<ctx->hash_info.num_hash_tiles; i++ ) {
fd_stem_publish( ctx->stem, FD_SNAPIN_HSH_IDX + i, FD_SNAPSHOT_HASH_MSG_RESET, 0UL, 0UL, 0UL, 0UL, 0UL );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment in topology, you just need one link here


fd_lthash_zero( &ctx->hash_info.full_lthash );
fd_lthash_zero( &ctx->hash_info.incremental_lthash );
fd_lthash_zero( &ctx->hash_info.result_lthash );
}

#define STEM_BURST 2UL /* For control fragments, one acknowledgement, and one malformed message */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer correct?

ctx->funk_txn = fd_funk_txn_prepare( ctx->funk, ctx->funk_txn, &incremental_xid, 0 );
ctx->full = 0;
ctx->state = FD_SNAPIN_STATE_LOADING;
ctx->funk_txn = fd_funk_txn_prepare( ctx->funk, ctx->funk_txn, &incremental_xid, 0 );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra whitespace?

/* A pending ack is waiting for lthash results to come back from
the hashing tiles. Once these hashes come back, the ack can be
sent. */
if( FD_LIKELY( ctx->hash_info.received_lthashes==ctx->hash_info.num_hash_tiles ) ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why in after credit, and not in after_frag when you receive the lthash?


fd_lthash_value_t * ref_lthash = ctx->full ? &ctx->hash_info.full_lthash : &ctx->hash_info.incremental_lthash;
if( FD_UNLIKELY( memcmp( ref_lthash, &ctx->hash_info.result_lthash, sizeof(fd_lthash_value_t) ) ) ) {
FD_LOG_WARNING(( "calculated accounts lthash %s does not match accounts lthash %s in snapshot manifest",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have to exit with an error here, because we already told replay the snapshot was done loading. Otherwise you need to design a reset mechanism for replay and all other tiles as well.

Copy link
Contributor

@mmcgee-jump mmcgee-jump left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

verify snapshot lthash
3 participants